*Crawl start date: 09 March, 2011 *

*Crawl end date: 23 December, 2011 *

*Number of captures: 2,713,676,341 *

*Number of unique URLs captured: 2,273,840,159 *

*Number of hosts captured: 29,032,069 *

__Link Extraction__

Links were extracted from all HTML documents in the collection. These links were then grouped into 'a href' links and non 'a href' links (embeds).
To determine the number of URLs still remaining to be crawled, we compare the list of all extracted links with the list of URLs already crawled.
*Note: We do not have Heritrix crawl logs available for this collection. As a result, we were not able to document how many URLs were not crawled because of robots.txt policies. We'll look into analyzing the crawled robots.txt files for each host to answer this. *

__Definitions__

**Hostgraph: **A graph of directed edges between hosts. If there's a page from host 'A' that points to a resource in host 'B', then we have a directed edge from 'A' to 'B' in the hostgraph.
*We considered only 'a href' links while generating the hostgraph. *

**PageRank**: Implementation of Google's PageRank algorithm. The algorithm is run on the hostgraph. Only hosts that also point to resources on other hosts (inter-host links) are considered. The actual number of links between hosts is ignored.
*The algorithm was run for a total of 10 iterations resulting in a list of hosts and their corresponding PageRank score.*

**PageRank bucket**: A list of hosts ordered by their PageRank scores (descending order) is split into a number of equal sized buckets. So, if there are 'N' hosts and 'b' buckets, then each bucket contains 'N/b' number of hosts, with bucket '1' representing hosts with the highest scores, and bucket 'b' representing hosts with the lowest scores.
*We created 50 PageRank buckets, each containing 275,950 hosts.*

**AlexaRank**: Ranked list of Alexa's Top 1 Million Hosts.
*The list was downloaded on 09/25/2012. 727,314 hosts from this list were crawled in the 2011 Wide Crawl (total of 584,374,573 captures from these hosts)*

**AlexaRank bucket**: A list of hosts ordered by their AlexaRank ranking (highest rank to lowest rank) is split into a number of equal sized buckets.
*We created 50 AlexaRank buckets, each containing 14,547 hosts.*

**Crawl completeness**: Ratio of number of crawled resources to total number of discovered resources. Crawl completeness = (crawled) / (crawled + notcrawled).
Per Host Crawl completeness is defined as the ratio of the number of crawled resources from the given host to the total number of discovered resources from the given host.
*For this collection, Crawl completeness = 2,273,840,159 / 22,814,673,355 = 0.0996 (9.96%)*

*For just the AlexaRank hosts, Crawl completeness = 584,374,573 / 6,447,189,768 = 0.0906 (9.06%)*

**Page completeness: **Ratio of number of crawled embeds to total number of linked embeds, where embeds are non 'a href' links. Page completeness = (crawled embeds) / (crawled embeds + notcrawled embeds). Per Host Page completeness is defined as the ratio of the number of crawled embeds linked to by the given host to the total number of embeds linked to by the given host.
*For this collection, Page completeness = 788,632,296 / 3,934,895,311 = 0.2004 (20.04%) *

*For just the AlexaRank hosts, Page completeness = 131,854,245 / 1,012,862,487 = 0.1301 (13.01%) *

